Large - Scale Non - Linear Regression within the Mapreduce Framework

نویسندگان

Ahmed Khademzadeh

Philip Chan

Georgios C. Anagnostopoulos

چکیده

Large-scale Non-linear Regression within the MapReduce Framework By: Ahmed Khademzadeh Thesis Advisor: Philip Chan, Ph.D. Regression models have many applications in real world problems such as finance, epidemiology, environmental science, etc.. Big datasets are everywhere these days, and bigger datasets would help us to construct better models from the data. The issue with big datasets is that they would need a long time to be processed or even to be read on a single machine. This research employs MapReduce to model large-scale non-linear regression problems in a parallel fashion. MRRT (MapReduce Regression Tree) algorithm divides the feature space into overlapping subspaces and then shuffles each of the subspace’s data items to a node in the cluster. Each node in the cluster then constructs a regression tree for the subspace of the data it has received. Different versions of algorithm (overlapping/non-overlapping subspaces and weighted/unweighted prediction using neighboring models) are proposed and compared with the regression tree (RT) algorithm implemented in Matlab libraries. Experiments on synthetic and real datasets show that MRRT algorithm that is devised to be fast and scalable for MapReduce framework not only has a close to linear speedup, and close to optimum scalability, but also outperforms the RT algorithm in terms of accuracy (in most cases) and improves the prediction time by more than 80%. Although MRRT is designed for MapReduce framework, it could be used on a single machine, and in that case it improves the learning time by 60% (in most cases) comparing to RT algorithm, and shows to be of close to linear scalability (comparing to RT algorithm which is roughly of quadratic scalability).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Adaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments

Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...

متن کامل

Robust Regression on MapReduce

Although the MapReduce framework is now the de facto standard for analyzing massive data sets, many algorithms (in particular, many iterative algorithms popular in machine learning, optimization, and linear algebra) are hard to fit into MapReduce. Consider, e.g., the `p regression problem: given a matrix A ∈ Rm×n and a vector b ∈ R, find a vector x∗ ∈ R that minimizes f(x) = ‖Ax− b‖p. The widel...

متن کامل

Utilizing Kernel Adaptive Filters for Speech Enhancement within the ALE Framework

Performance of the linear models, widely used within the framework of adaptive line enhancement (ALE), deteriorates dramatically in the presence of non-Gaussian noises. On the other hand, adaptive implementation of nonlinear models, e.g. the Volterra filters, suffers from the severe problems of large number of parameters and slow convergence. Nonetheless, kernel methods are emerging solutions t...

متن کامل

Parallel Implementation of Multiple Linear Regression Algorithm Based on MapReduce

The amount of data generated by traditional business activities, creating data repositories ranging from terabytes to petabytes in size. However, this information cannot be practically analyzed on a single commodity computer because the data is too large to fit in memory. For this purpose, the large size of data to be processed requires the use of high-performance analytical systems running on ...

متن کامل

Parallel extreme learning machine for regression based on MapReduce

Regression is one of the most basic problems in data mining. For regression problem, extreme learning machine (ELM) can get better generalization performance at a much faster learning speed. However, the enlarging volume of datasets makes regression by ELM on very large scale datasets a challenging task. Through analyzing the mechanism of ELM algorithm, an efficient parallel ELM for regression ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2013

Large - Scale Non - Linear Regression within the Mapreduce Framework

نویسندگان

چکیده

منابع مشابه

Adaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments

Robust Regression on MapReduce

Utilizing Kernel Adaptive Filters for Speech Enhancement within the ALE Framework

Parallel Implementation of Multiple Linear Regression Algorithm Based on MapReduce

Parallel extreme learning machine for regression based on MapReduce

عنوان ژورنال:

اشتراک گذاری